import Pkg
Pkg.activate(@__DIR__)
Pkg.instantiate()Homework 3 Solutions
BEE 4850/5850
Overview
Instructions
The goal of this homework assignment is to practice developing and working with probability models for data.
- Problem 1 asks you to quantify uncertainties for a simple model using Bayesian statistics and rejection sampling.
- Problem 2 asks you to use Monte Carlo simulation to calculate annual expected flooding damages.
- Problem 3 (only required for students in BEE 5850) asks you to conduct cost-benefit analyses of housing elevation levels under future flooding uncertainty using Monte Carlo simulations.
Load Environment
The following code loads the environment and makes sure all needed packages are installed. This should be at the start of most Julia scripts.
The following packages are included in the environment (to help you find other similar packages in other languages). The code below loads these packages for use in the subsequent notebook (the desired functionality for each package is commented next to the package).
using Random # random number generation and seed-setting
using DataFrames # tabular data structure
using CSV # reads/writes .csv files
using Distributions # interface to work with probability distributions
using Plots # plotting library
using StatsBase # statistical quantities like mean, median, etc
using StatsPlots # some additional statistical plotting tools
using Optim # optimization tools
Random.seed!(500)Problems
Scoring
- Problem 1 is worth 13 points.
- Problem 2 is worth 7 points.
- Problem 3 is worth 5 points.
Problem 1
The probability model for the cheating responses is based on a Binomial distribution, which reflects the number of successes (“yes” responses in this case) in \(n\) trials with probability \(p\). What is the probability of getting a “yes”? This can happen in two ways. First, with probability \(0.5\), the student gets a heads in the first toss and answers honestly, which means they will answer “yes” with probability \(\theta\). Otherwise, they flip the coin again and answer “yes” if the coin comes up heads, an outcome which occurs with probability 0.25. So the probability of a “yes” is \(p=0.5\theta + 0.25\). Therefore, the probability model is:
\[ \begin{aligned} y &\sim Binomial(100, 0.5\theta + 0.25) \\ \theta &\sim \Beta(3, 40). \end{aligned} \]
Now we can use a grid approximation to compute and visualize the posterior given the data of 31 “yes” answers. We will still do our calculations on the log scale to avoid issues with numerical underflow, even though we will plot the posterior on the density scale to help our rejection sampling.
lprior(θ) = logpdf(Beta(3, 40), θ)
llikelihood(θ, y) = logpdf(Binomial(100, 0.5 * θ + 0.25), y)
lposterior(θ, y) = lprior(θ) + llikelihood(θ, y)
θ = 0:0.005:1
lp = lposterior.(θ, 40)
plot(θ, exp.(lp))To draw samples with rejection sampling, we have a few options. As we can see from Figure 1, the density is bounded from above by 0.03, so we can use simple uniform rejection sampling. This will likely not be very efficient since the posterior is close to zero along most of the unit interval. However, it’s relatively straightforward, so we’ll just use it.
post_samples = zeros(1_000)
# we will use a counter to keep count of how many samples we've kept
kept_samples = 0
while kept_samples < 1_000
# generate a proposal
x = rand(Uniform(0, 1))
u = rand(Uniform(0, 1))
if 0.03 * u < exp(lposterior(x, 40))
kept_samples += 1
post_samples[kept_samples] = x
end
end
histogram(post_samples, label=false, xlabel="θ", ylabel="Count")- 1
- We’re just doing this one at a time: we could also batch the samples.
Comparing Figure 2 to Figure 1, the distributions look similar. So we don’t need more samples, though it never hurts to sample more. Now we can summarize these samples to find the mean and 90% credible interval.
θ_mean = mean(post_samples)
θ_ci = quantile(post_samples, [0.05, 0.95])
@show θ_mean * 100;
@show θ_ci;θ_mean * 100 = 11.594260902070134
θ_ci = [0.03840318830527911, 0.20848848016469992]
So our mean cheating probability given the data is 12% with a 90% credible interval of (4%, (21%). This suggests a generally low rate of cheating, though whether the upper limit of the 90% CI is acceptable is subjective.
One question we should always ask ourselves is the extent to which our inferences from a Bayesian model are driven be the prior, rather than the likelihood. Let’s redo our analysis with a less informative (more permissive) prior to see if we end up with similar inferences. We can try a Beta(2, 10) distribution, as shown in Figure 3, but there are many other choices.
θ = 0:0.01:1
plot(θ, pdf.(Beta(3, 80), θ), lw=2, label="Original Prior", xlabel="θ", ylabel="Density")
plot!(θ, pdf.(Beta(2, 10), θ), lw=2, label="Alternate Prior")Let’s repeat the entire analysis from before and see what the posterior mean and 90% credible intervals look like. First we need to find the new upper bound on the posterior density.
lprior_alt(θ) = logpdf(Beta(2, 10), θ)
lposterior_alt(θ, y) = lprior_alt(θ) + llikelihood(θ, y)
θ = 0:0.005:1
lp = lposterior_alt.(θ, 40)
plot(θ, exp.(lp))So using \(M=0.2\) will work, which we will do to finish the sampling.
post_samples = zeros(1_000)
# we will use a counter to keep count of how many samples we've kept
kept_samples = 0
while kept_samples < 1_000
# generate a proposal
x = rand(Uniform(0, 1))
u = rand(Uniform(0, 1))
if 0.2 * u < exp(lposterior_alt(x, 40))
kept_samples += 1
post_samples[kept_samples] = x
end
end
θ_mean = mean(post_samples)
θ_ci = quantile(post_samples, [0.05, 0.95])
@show θ_mean * 100;
@show θ_ci;θ_mean * 100 = 23.7464694628426
θ_ci = [0.1137444002256321, 0.37512811469430585]
We can see that the use of this alternate prior has resulted in a much higher estimate and larger uncertainties. This suggests that our prior is doing a lot of work in how we interpret the outcomes of the testing procedure, and it might result in too much ambiguity to get a sense of how much of the class is actually cheating, as with our original prior, we estimated 12% of the class was cheating, while with this broader prior, we estimate 24% is cheating, with potentially up to around 40%.
Problem 2
First, we need to fit the extreme water distribution to the provided data.
# load data
dat = CSV.read("data/water_depths.csv", DataFrame)
# fit LogNormal distribution
wl_dist = fit(LogNormal, dat.depths)- 1
-
The
fitcommand fromDistributions.jluses the methods of moments for some distributions (such as the LogNormal distribution) and optimizeslogpdffor others.
Distributions.LogNormal{Float64}(μ=0.14057503250689243, σ=0.25281750343529374)
We can see how the fitted distribution matches the data.
histogram(dat.depths, xlabel="Annual Max Water Level (m)", ylabel="Density", normalize=:pdf, label="Data")
plot!(wl_dist, color=:red, lw=2, label="Fitted Distribution")- 1
-
normalize=:pdfscales the histogram to a density scale instead of counts, so the plotted density will have the right scale.
If you used an optimization routine to find the lognormal parameters, the scale parameter will differ at the thousandths place. It appears this is enough to shift all of the subsequent estimates downward by around $3,000.
Now let’s draw samples. Note that, at some point, we will need to shift these by 1.5m to capture the flooding depth at the house, so we may as well do that here.
nsamp = 10_000
wl_samples = rand(wl_dist, nsamp)
wl_samples_shift = wl_samples .- 1.5
histogram(wl_samples, xlabel="Annual Max Water Level (m)", ylabel="Count", label="Stream", alpha=0.5)
histogram!(wl_samples_shift, label="House", alpha=0.5)Let’s define our depth-damage function, \[d(h) = \mathbb{I}[h > 0] \frac{1}{1+\exp(-k(x-x_0))},\] with \(k = 1.25\) and \(x_0 = 2\).
h = 0:0.01:5
dd(h, k, x₀) = (h .> 0) * 1 ./ (1 .+ exp.(-k * (h .- x₀)))
plot(h, dd(h, 1.25, 2), lw=2, xlabel="Flood Depth (m Above Base Elevation)", ylabel="Fraction of Value Damaged", legend=false)
ylims!(0, 1)Finally, to conduct the Monte Carlo simulation we propagate our shifted water level samples through the depth-damage function. Since the depth-damage function dd gives us the fraction of value that is damaged, we also need to multiply by the value of the house.
dd_samples = 400_000 * dd(wl_samples_shift, 1.25, 2)
p1 = histogram(dd_samples, xlabel="Annual Damaged Value (\$)", ylabel="Count")
p2 = histogram(dd_samples[dd_samples .> 0], xlabel="Annual Damaged Value (\$)", ylabel="Count")
xlims!(p2, (0, 90_000))
display(p1)
display(p2)Notice two things in Figure 7 which are characteristic of flood damage distributions: (1) the vast majority of samples (85%, in this case) are zero, and (2) the transition between zero and non-zero samples is not “smooth” (there is a jump) due to the large damages from any water penetration into the house.
To determine if this was enough samples, let’s monitor how the Monte Carlo estimate and the standard error changes with each additional sample, which we can see in Figure 8.
dd_est = zeros(nsamp)
dd_est[1] = dd_samples[1]
dd_se = zeros(nsamp)
for i = 2:nsamp
dd_est[i] = ((i - 1) * dd_est[i - 1] + dd_samples[i]) / i
dd_se[i] = std(dd_samples[1:i]) / sqrt(i)
end
plot(1:nsamp, dd_est, ribbon=(1.96 * dd_se), ylims=(0, 1e4), legend=false, fillalpha=0.3)From Figure 8, it looks like the estimate has converged with a relatively small confidence interval. Namely, our estimate of the damages in one year are $5976 with a standard error of $149, resulting in a 95% confidence interval of $(5683, 6268).
Should we add more samples? There’s no “right” answer to this, but it depends on the level of desired precision. Is the $300-ish range of the 95% confidence interval sufficient to understand the relative risk? Probably, if we’re talking about a $400,000 house, but if more precise investments were under consideration that fell into that rough range, we might want a more precise estimate. On the other hand, there are other errors introduced throughout the process, starting with the approximation to the streamflow data from the distribution, and including any approximations due to the depth-damage relationships; the Monte Carlo error is likely smaller than those.
Problem 3
Now we want to calculate the net present value of a given damage sample over a 30 year period.
# dd is the depth-damage sequence (including scaling for value)
# γ is the discount rate
function npv(damages, γ)
T = length(damages)
dam_discount = damages .* (1 .- γ).^(0:T-1)
return sum(dam_discount)
endnpv (generic function with 1 method)
Our Monte Carlo samples need to be 30-year sequences of water levels. We’ll start with 10,000 samples, and check if the estimated expected NPV converges.
T = 30
nsamps = 10_000
wl_samples = rand(wl_dist, (nsamps, T)) .- 1.5
dd_samples = mapslices(row -> npv(400_000 * dd(row, 1.25, 2), 0.04), wl_samples; dims=2)
dd_est = zeros(nsamp)
dd_est[1] = dd_samples[1]
dd_se = zeros(nsamp)
for i = 2:nsamp
dd_est[i] = ((i - 1) * dd_est[i - 1] + dd_samples[i]) / i
dd_se[i] = std(dd_samples[1:i]) / sqrt(i)
end
plot(1:nsamp, dd_est, ribbon=(1.96 * dd_se), ylims=(9e4, 1.3e5), legend=false, fillalpha=0.3)For now that looks alright; we’ll keep including the Monte Carlo standard error as we incorporate elevation heights to see how the uncertainty impacts the cost-benefit analysis.
Now we need to redo the analysis with varying levels of elevation. We’ll plot the resulting estimates (including the standard error) along with the construction costs and the net benefits (NPV of benefits - costs). Note that the benefits are the reduction in damages relative to the baseline, so we should compute the Monte Carlo estimate of this difference for a given sequence of water samples.
h = 0:0.5:5
# construction cost function
C(h) = (h .> 0) * (100_000 + 2_000 * h)
costs = C.(h)
benefits_est = zeros(length(h))
benefits_se = zeros(length(h))
for (i, l) in enumerate(h)
dd_samples_baseline = mapslices(row -> 400_000 * npv(dd(row, 1.25, 2), 0.04), wl_samples; dims=2)
dd_samples_heighten = mapslices(row -> 400_000 * npv(dd(row, 1.25, 2), 0.04), wl_samples .- l; dims=2)
dd_samples_benefits = dd_samples_baseline - dd_samples_heighten
benefits_est[i] = mean(dd_samples_benefits)
benefits_se[i] = std(dd_samples_benefits) / sqrt(nsamps)
end
plot(h, benefits_est .- costs, marker=(:circle, 3), ribbon = 1.96 * benefits_se, fillalpha=0.3, color=:blue, xlabel="ΔH (m)", ylabel="Net Benefits (\$)", label=false)
hline!([0], color=:red, linestyle=:dot, label=false)We can see from Figure 10 that the only heightening level that consistently provides positive net benefits is 1.0m, though 1.5m would be expected to provide positive net benefits but the lower end of the 95% confidence interval is below the zero line. Nevertheless, the expected net benefits from 1.0m exceed those from 1.5m, so we would likely recommend 1.0m in any case.
If you used an optimization routine to find the lognormal parameters, the bias from Problem 2 will result in no heightening level passing the cost-benefit analysis, though the general trend in how the net benefits change with subsequent heightening levels will be similar.